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Selection pressures on proteins are usually measured by comparing homologous nu- 
cleotide sequences |57j . Recently we introduced a novel method, termed 'volatility', 
to estimate selection pressures on protein sequences from their synonymous codon 
usage [3H1 EH]- Here we provide a theoretical foundation for this approach. We 
derive the expected frequencies of synonymous codons as a function of the strength 
of selection, the mutation rate, and the effective population size. We analyze the 
conditions under which we can expect to draw inferences from biased codon usage, 
and we estimate the time scales required to establish and maintain such a signal. 
Our results indicate that, over a broad range of parameters, synonymous codon 
usage can reliably distinguish between negative selection, positive selection, and 
neutrality. While the power of volatility to detect negative selection depends on 
the population size, there is no such dependence for the detection of positive se- 
lection. Furthermore, we show that phenomena such as transient hyper-mutators 
in microbes can improve the power of volatility to detect negative selection, even 
when the typical observed neutral site heterozygosity is low. 

1 Introduction 

Nucleotide coding sequences of many organisms exhibit significant codon bias - that is, 
unequal usage of synonymous codons. Codon bias has been attributed both to neutral pro- 
cesses, such as asymmetric mutation rates, as well as to selection acting on the synonymous 
codons themselves. The most common selective explanation of codon bias posits that syn- 
onymous codons differ in their fitness according to the relative abundances of iso-accepting 
tRNAs; a codon corresponding to a more abundant tRNA would be used preferentially so 
as to increase translational efficiency j2Hl HH EH] ■ To a large extent, this hypothesis has suc- 
cessfully explained interspecific variation in genome-wide codon usage for organisms ranging 
from Escherichia coli to Drosophila melanogaster pQ. 

Recently, however, we have noted that codon bias in a protein sequence can also result 
from selection at the amino acid level, even in the absence of direct selection on synonymous 
codons themselves [SHI EH]- Codon bias arises from selection at the amino acid level because of 
asymmetries in the structure of the standard genetic code. Proteins that experience different 
selective regimes should exhibit different synonymous codon usage. Following from this 
observation, we have introduced methods to screen a single genome sequence for estimates 
of the selection pressures acting on its proteins by comparing their synonymous codon usage 

PS|. 
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In this paper, we provide a theoretical discussion of codon usage biases that result from 
selection at the amino acid level. Our analysis helps to provide a theoretical grounding for 
techniques of estimating selection pressures on proteins using signals gathered from their 
synonymous codon usage [33 [HH] - Throughout most of this paper, we will ignore any source 
of direct selection on synonymous codons, and focus on the codon biases that result purely 
from selection at the amino acid level. To the extent that any other sources of codon bias 
apply equally across the genome, we have devised a bootstrap method to control for these 
external sources of codon bias when estimating selection pressures on proteins [HUES!- I n 
the discussion, however, we describe a range of confounding factors that may vary across 
the genome in some organisms and limit the applicability of codon-based methods to detect 
selection. 



Codon usage biases can arise from the familiar process of selection on proteins because 
synonymous codons may differ in their volatility - defined, loosely, as the proportion of a 
codon's point mutations that result in an amino acid substitution j3H|- Although there are 
several possible definitions of volatility, which can all be informative, we have recently used 
the following formal definition [3*§j . 

We index the 61 sense codons in an arbitrary order % — 1 . . .61. We use the notation 
aa(i) to denote the amino acid encoded by codon i. For each codon i, let B{i) denote the set 
of sense codons that differ from codon i by a single point mutation. We define the volatility 
of codon i by: 



where D denotes the Hamming metric, which is zero if two amino acids are identical, and one 
otherwise. The definition in Eq. Q applies when all nucleotide mutations occur at the same 
rate. When differential nucleotide mutation rates are known (e.g. a transition/transversion 
bias jUj), these rates can be incorporated into the definition of codon volatility by appro- 
priately weighting the ancestor codons 39j . 

Minor variants of Eq. Qyield related definitions of codon volatility. For some applications, 
one may want to allow termination codons in the definition of B(i). It is also natural to 
consider alternatives to the Hamming metric, D, that weight substitutions between amino 
acids depending upon the differences in their stereochemical properties [SHUSH!- A variety of 
other metrics [JH EH] that reflect the effects of different amino acid substitutions on protein 
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structure may likewise be incorporated into the definition of codon volatility. In this paper, 
however, we will focus on the most basic definition of codon volatility (Eq. ^ using the 
Hamming metric), because variant definitions are based on the same underlying principle 
and produce similar results in practice |3*H| . 

Under the most basic definition of volatility, there are four amino acids (Glycine, Leucine, 
Arginine, and Serine) whose codons differ in their volatility. As a result, when controlling 
for amino acid content, we obtain a volatility signal from only those sites that contain one of 
these four amino acids - which amounts to about 30% of the sites in a typical gene. (If one 
uses stereochemical metrics |3*3*1 I3*H] for D in the definition of volatility, then ~ 75% of the 
sites in a gene contain a volatility signal). Although 30% may seem like a small proportion 
of sites from which to obtain a signal of selective pressures, it is larger than the proportion 
of sites often used to detect selection via sequence comparison of recently diverged species 
|14[ |H] . (For example, fewer than 4% of neutral sites exhibit substitutions when comparing 
human and chimpanzee sequences jHj.) 

In the following sections we analyze the consequences of selection on proteins for codon 
usage in general, as well as for the volatility measure in particular. We demonstrate that the 
expected codon usage at a site, as well as its temporal dynamics, depend upon the strength 
of positive or negative selection on the amino acid sequence. In Sections H3 through |5] we 
examine negative selection in infinite and finite populations. In Section El we discuss positive 
selection. Our analysis is initially confined to the patterns of codon usage at a single site 
under selection at the amino acid level. Proceeding from this analysis, we also discuss codon 
usage over many sites within a gene or genome, and analyze how many sites are required in 
principle to detect a reliable signal of selection by inspecting synonymous codon usage. 

3 Negative Selection and Codon Bias in an Infinite 
Population 

Most nonsynonymous mutations in a protein coding sequence presumably reduce the fitness 
of an organism. For a large proportion of sites, therefore, natural selection opposes any 
change in the amino acid. We refer to this type of selection as "negative selection." 

For the purposes of exploring the effect of negative selection on codon usage, we assume 
that selection cannot discriminate between the synonymous codons for the favored amino acid 
at a site. However, mutations are more likely to be nonsynonymous, and hence deleterious, 
if the codon at that site has high volatility. As we will show, this fact results in an effective 
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preference for the less volatile codons, among those codons that code for the favored amino 
acid at the site. We emphasize that this preference for a codon of low volatility at a site under 
negative selection is not caused by a direct fitness difference between synonyms. Rather, more 
volatile codons will occur less frequently as a second-order consequence of negative selection 
at the amino acid level, and the structure of the genetic code. 

Proteins with a larger number of sites under negative selection will exhibit a statistical 
bias towards less volatile codons, after controlling for their amino acid content. Here we 
calculate the expected magnitude of the codon bias as a function of the mutation rate, the 
strength of negative selection, and, in Section HI the population size. We also analyze the 
conditions under which we can expect to detect and draw inferences from this bias, and we 
estimate the time scales needed to establish and maintain such a signal. 

3.1 A simplified genetic code 

In an infinite population, we can describe the dynamics of codon usage at an individual 
site by using the standard multi-allele model first introduced by Haldane [THj and used 
throughout the literature {e.g. ref. [33] Eq. 2.25 or ref. |21j). This model describes a single 
site which can assume any of K states. In order to investigate codon usage, we consider 
K = 64 states, corresponding to each of the 64 possible codons. In continuous time, the 
frequency Xi of individuals with codon i evolves according to 

^^^(^-^(i) (2) 
j'=i 

where Wj is the Malthusian fitness of codon j, W(t) = Yjj w j x j(t) * s the mean fitness 
of the population, and My is the instantaneous rate of mutation from codon j to codon 
i, with Ylj Mij = 0. Although Eq. El is non-linear, the equilibrium frequencies of the 
"alleles" i — 1,2, ... K are given by the leading eigenvector of the matrix w^M^ [38] • These 
frequencies determine the expected equilibrium codon usage at a site. For the purposes of 
this paper, alternative formulations of the .fT-allele model that treat the processes of selection 
and mutation separately (e.g. ref [TU] Eq. 6.4.1) yield the exact same results. 

The equilibrium solution to Eq. EJfor the full genetic code does not lend itself to intuitive 
understanding. Transient dynamics are also difficult to calculate in this high- dimensional 
system. Therefore, in order to highlight the essential points of our analysis, we first consider 
a "toy" genetic code that retains those features of the true genetic code relevant to the study 
of synonymous codon usage under negative selection. As we will demonstrate, the solution 
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for the simplified genetic code yields a complete understanding for the full genetic code as 
well. 

We imagine a simplified genetic system with only three possible codons, a%, a 2 , and b. 
Codons a% and a 2 code for amino acid A, which is favored, and codon b encodes amino acid 
B, which has selective disadvantage cr. We assume that mutations occur at rate u between 
these codons according to the structure 



a 1 



a 2 



so that of the two synonymous codons, a 2 is more volatile. 

According to the standard multi-allele model (Eq. EJ), the relative frequencies of codons 
a\, a 2 , and b are described by the equation 
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ai(t) +a 2 (t) + (l-a)b(t). 
The equilibrium frequencies of codons are given by the leading eigenvector of the matrix 
in Eq. |H1 A simple perturbation analysis of this eigenvector shows that the equilibrium 
frequency of a\ depends monotonically on cr, and it exhibits a sharp transition between two 
regimes: the weak selection regime u <C u and the strong selection regime o ^> u. In the 
weak selection regime, the equilibrium relative frequencies of synonyms are given by the 
expansion 

d\ 



1 la n o' 



a x + a 2 

And in the strong selection regime, the equilibrium relative frequencies are given by 



a 1 



V5-1 (5-2v / 5)(l 
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(4) 



(5) 



a 1 + d 2 2 5 

In the absence of selection (a = 0) all three codons occur with equal frequency, as we 
would expect. In particular, the relative frequency of the two synonymous codons at and 
a 2 equals |, regardless of the mutation rate. For weak selection (cr <^ u), this result is still 
approximately true, according to the perturbation expansion above. In the case of strong 
negative selection (cr ^> u), the relative frequency of the two synonymous codons is given 
approximately by the inverse of the golden mean, v/ ^~ 1 w 0.62. 

The sharp transition between the weak and strong selection regimes defines cr = u as a 
critical value for negative selection. For cr u negative selection is ineffective at favoring 
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the less volatile codon, and the site is effectively neutral. But when a ^> u, negative selection 
favors the less volatile codon, and the magnitude of this effect depends only weakly on the 
value of a. This is an essential point. In the strong selection regime, the magnitude of 
negative selection is relatively unimportant; volatile codons are disfavored at all sites where 
o 3> u. The transition between the weak and strong selection regimes is shown in Fig. 




Strength of selection (a) 

Figure 1: The relationship between selection at the amino acid level and resulting synony- 
mous codon usage. The graph shows relative equilibrium frequency of synonymous codons, 
&\j [d\ + d 2 ), as a function of the strength of negative selection, a. The relative frequency of 
codon a\ is approximately ~ in the weak selection regime (a <C u) , and approximately % l 
in the strong selection regime (a ^> u). In this figure u = 1CT 5 . 



3.2 The effective disadvantage of a volatile codon 

The critical value of a discussed above can be understood intuitively by considering the 
"effective selective disadvantage" of the more volatile codon a 2 that results indirectly from 
its volatility. We will use the notion of an "effective selective disadvantage" to aid in our 
analysis of codon usage at a site under negative selection. But we emphasize that our model 
(Eq. |2J) does not assume any direct fitness difference between synonymous codons. 

When the disfavored amino acid B is lethal to the organism, then the effective selective 
disadvantage of codon a 2 is particularly simple to understand. In this case, individuals with 
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codon a 2 are removed from the population at rate u because they mutate to the lethal codon 
b, but receive no back-mutations. Hence the effective selective disadvantage, denoted s, of 
codon a 2 versus codon a\ is given by s = u. The effective selective disadvantage of a 2 does 
not arise from a fitness difference between synonyms, but rather from selection at the level 
of amino acids and the structure of the genetic code. 

When amino acid B is not lethal the situation is slightly more complicated. Nevertheless, 
for o 3> it, mutations from a 2 to b typically die due to negative selection before they mutate 
back from b to a 2 . As a result, the effective selective disadvantage will still be s = u 
in the regime of strong selection. We can make this argument concrete by considering 
the mutation-selection balance between codon b and codon a 2 . According to the standard 
mutation-selection balance, the equilibrium frequency of codon b relative to codon a 2 equals 
- in the regime a S> u. Thus for each mutant from a 2 to b, there are at most of order - 
mutations from b to a 2 . The net mutation rate from a 2 to b is therefore u (l — -). This is 
the rate at which individuals of type a 2 are lost from the population due to the fact that 
a 2 is more volatile than a\. Thus the effective selective disadvantage of a 2 relative to a\ is 
given by s — u (l — -). By definition, in the strong selection regime we neglect - compared 
to 1, and the effective selective disadvantage of codon a 2 is simply s = u. 

A similar argument holds for the real genetic code. In this case, the favored amino acid 
may correspond to several synonymous codons, each with a potentially different volatility. 
However, the effective selective disadvantage, s, of a more volatile codon relative to a less 
volatile synonym is simply the difference in the number of mutations leading to a disfavored 
codon (a ^> u) times |, where u is the nucleotide mutation rate. (Note that | is the rate 
of mutation between any two particular nucleotides.) For example, when considering the 
relative frequencies of codons AGA and CGG at a site under negative selection for Argi- 
nine, AGA has selective disadvantage s — |u compared to CGG, since AGA has two more 
disfavored neighbors than CGG. By using the value of the effective selective disadvantage, 
s, we can calculate the equilibrium relative frequency of any pair of synonymous codons 
in mutation-selection balance, and thereby deduce the relative frequencies of all synonyms. 
Therefore, we can predict synonymous codon usage in the genetic code without resorting to 
the full solution of Eq. EJ 

An analogous argument can be used to calculate the effective selective disadvantage of 
codon a 2 in the regime of weak selection (a <^ u). In this regime, the relative equilibrium 
frequency of codon b versus codon a 2 equals 1 — j£r- Thus, the effective selective disadvantage 
of a 2 versus a\ is approximately s = 0, plus a small correction of order a. In other words, 
when a <Cu selection between a>i and a 2 is effectively neutral; it cannot generate codon bias. 
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We therefore refer to the regime a C a as the "almost neutral regime." This result holds 
both for the simplified three-codon model and for the real genetic code. 

It is also important to calculate the amount of time required to reach equilibrium codon 
usage in the presence of strong negative selection. Explicit solution of Eq. El assuming 
a ^> u, indicates that the e-fold relaxation time is of order - (the selection coefficient is 
s ~ u, and so the time scale for population sizes to change under selection is of order - ~ i). 
In other words, starting from any initial frequencies ai(0) and 02(0), these frequencies will 
become e-fold closer to their equilibrium values after a duration of order - generations. The 
same time scale holds for almost neutral sites (o <C u) and for the real genetic code^. In 
practice, u will be quite small, and equilibrium volatility is approached very slowly. We will 
revisit this point when we discuss finite populations, and again when we discuss positive 
selection. 



3.3 A specific example: selection for Arginine 



In this section we consider a simple example that demonstrates how our analysis applies to 
the real genetic code. We use Eq. |2]to model the dynamics of K = 64 alleles corresponding 
to the 64 codons, indexed in an arbitrary order. For our example, we consider a single site 
under negative selection for an Arginine codon. In this case we define 



M, 
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1 — 3u, if i—j 

u/3, if % and j differ by a point mutation 
0, otherwise 



(6) 



where u is the nucleotide mutation rate. We define 



W; 



1, if i encodes Arginine 

1 — cr, Hi encodes a non-Arginine amino acid 
1 — 7, if % encodes stop 



(7) 



so that a codon encoding an amino acid other than Arginine has fitness 1 — o, and a termi- 
nation codon has fitness 1 —7. We analyze this model numerically by calculating the leading 
eigenvector of the matrix WjMij, which yields the equilibrium frequencies of all 64 codons. 

'For <7<Cu, the process is almost neutral and the time scale calculation of Section fOl applies . The real 
genetic code has the same dynamics because we still have s ~ u for a 3> u and neutral behavior for a <C u. 
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In the case of no selection (a = 7 = 0), we find that all codons occur with the same 
equilibrium frequency, independent of mutation rate, as we would expect. For almost neutral 
selection ((t~7<m), codon usage is still approximately uniform. In the opposite case when 
Arginine is favored and all other amino acids (or termination codons) are strongly disfavored 
(i.e. a ~ 7 > u), the Arginine codons CGA, CGG, CGC, CGT, AGA, and AGG occur 
with equilibrium relative frequencies ps 0.214 : 0.214 : 0.191 : 0.191 : 0.095 : 0.095. As 
expected, under negative selection the more volatile Arginine codons occur with lower relative 
frequency in equilibrium. 

The equilibrium frequencies of Arginine codons determine the expected volatility at a 
single Arginine site under negative selection. Assuming free recombination [12], an individual 
gene consists of many such sites randomly assembled; the mean and standard deviation in 
the volatility (per site) of a randomly sampled gene are shown in Fig. as a function of 
the strength of negative selection a. Note that the stronger the negative selection, the lower 
the expected equilibrium volatility. The expected volatility exhibits a sharp transition from 
high to low values when the strength of negative selection a reaches the mutation rate u, 
as discussed above. On either side of this transition, the volatility is insensitive to a. The 
standard deviations plotted in Fig. |21 correspond to a gene comprised of L = 200 such sites, 
each modeled independently by the multi-allele equation. 

According to Fig. L = 200 independent sites that each experience neutrality (a <C u) 
can be distinguished on the basis of their volatility from L = 200 sites that experience 
negative selection (a 3> u). The difference in the expected volatility between these two 
regimes is greater than four standard deviations of the volatility within either regime. 

In reality, the selective constraint a will vary greatly across the sites of a given protein. 
In this case, disregarding the possibility of positive selection, the volatility of a gene (after 
controlling for its amino acid sequence) essentially reflects the relative number of informative 
sites that experience negative selection versus neutrality. For example, the volatility of gene 
X that contains L = 200 informative sites under negative selection and an equal number of 
neutral sites will be significantly greater (with a Z-score of about three) than the volatility 
of gene Y that consists of 2L informative sites all under negative selection. A more thorough 
discussion of variable selection pressures across genes is described in Section 14. 1[ below. 

Table ^ shows the equilibrium relative frequencies of synonymous codons for each of the 
informative amino acids (G, L, R, and S) under neutrality versus various selective regimes. In 
TableHJwe assume, as we do throughout this manuscript, that volatility is measured using the 
Hamming metric and that there is no transition/transversion bias. Corresponding values for 
different metrics or including a mutational bias may be calculated using the same approach. 
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10" 8 10" 6 10" 4 10" 2 1 

Strength of selection (a) 

Figure 2: The relationship between selection and volatility for a gene comprised of L = 200 
freely recombining sites under selection for Arginine. The graph shows expected volatility 
per site in the gene (±1 standard deviation, dashed) as a function of the strength of negative 
selection, a. The nucleotide mutation rate is u = 10~ 5 . The expected volatility is significantly 
depressed in the regime of strong negative selection, a ^> u. (For this figure we assume 7 = 1; 
virtually identical results hold for 7 = a.) 

As seen in Table ^ the difference in the expected volatility between selective regimes is 
least extreme (indeed, barely informative) for Glycine sites. The volatility difference is 
most extreme for serine sites: the highly volatile codons AGT and AGC are not expected 
to occur at a site under negative selection, but they preferentially occur at a site under 
positive selection. This extreme case results from the fact that codons AGT and AGC are 
not connected by synonymous point mutations to the other serine codons. This situation 
does not imply that codons AGT and AGC should be treated separately from the other 
serine codons. In fact, when treated as an entire group, the serine codons are particularly 
informative for positive selection (Table Q)- 

4 Negative Selection in a Finite Population 

The models presented in Section El describe the processes of mutation and negative selection 
in an infinite population. In finite populations, however, genetic drift also affects allelic 
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Neutral 


Neutral* 


Negative 


Positive 


V 


Leucine 












eta 


0.16667 


0.17300 


0.21353 


0.14213 


5/9 


etc 


0.16667 


0.18580 


0.19098 


0.17056 


6/9 


ctg 


0.16667 


0.17890 


0.21353 


0.14213 


5/9 


ctt 


0.16667 


0.18580 


0.19098 


0.17056 


6/9 


tta 


0.16667 


0.12990 


0.09549 


0.18274 


5/7 


ttg 


0.16667 


0.14650 


0.09549 


0.19188 


6/8 


EM 


0.65146 


0.64590 


0.63172 


0.65978 




a[u] 


0.07362 


0.07259 


0.07022 


0.07217 




Arginine 












aga 


0.16667 


0.15210 


0.09549 


0.19149 


6/8 


agg 


0.16667 


0.17050 


0.09549 


0.19859 


7/9 


cga 


0.16667 


0.15210 


0.21353 


0.12766 


4/8 


cgc 


0.16667 


0.17740 


0.19098 


0.17021 


6/9 


egg 


0.16667 


0.17050 


0.21353 


0.14184 


5/9 


cgt 


0.16667 


0.17740 


0.19098 


0.17021 


6/9 


em 


0.65278 


0.65400 


0.62592 


0.66766 




a[u) 


0.09854 


0.09660 


0.09354 


0.09528 




Serine 












asrc 


0.16667 


0.18510 


0.00000 


0.20636 


8/9 


agt 


0.16667 


0.18510 


0.00000 


0.20636 


8/9 


tea 


0.16667 


0.13440 


0.25000 


0.13265 


4/7 


tec 


0.16667 


0.17190 


0.25000 


0.15477 


6/9 


teg 


0.16667 


0.15162 


0.25000 


0.14510 


5/8 


tct 


0.16667 


0.17190 


0.25000 


0.15477 


6/9 


— rfTT — 1 

EM 


0.71792 


0.72981 


0.63243 


0.73970 




M 


0.12504 


0.12561 


0.03913 


0.12847 




Glycine 












gga 


0.25000 


0.22460 


0.25000 


0.23810 


5/8 


ggc 


0.25000 


0.26180 


0.25000 


0.25397 


6/9 


ggg 


0.25000 


0.25170 


0.25000 


0.25397 


6/9 


ggt 


0.25000 


0.26180 


0.25000 


0.25397 


6/9 


em 


0.65625 


0.65724 


0.65625 


0.65675 




o\v\ 


0.01804 


0.01859 


0.01804 


0.01775 





Table 1: Equilibrium codon usage under neutrality versus selective regimes. In each selective 
regime, we report the equilibrium relative abundance of codons, and the resulting mean and 
standard deviation in volatility per site. The first column corresponds to neutrality (a = 
7<Cm); the second column corresponds to neutrality but with disfavored termination codons 
{a <C u, 7 = 1); the third column corresponds to strong negative selection in an infinite 
population (er 3> u, 7 ^> u); the fourth column corresponds to the expected frequencies 
after a positively selected sweep (see Section EJ. The final column gives the volatility of each 
codon, assuming no transition/transversion bias 
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frequencies. In this section, we study the combined effects of mutation, negative selection, 
and drift, which we analyze using diffusion equations. These equations can be very complex. 
A full treatment of even the simplified three-codon genetic code requires a two-dimensional 
diffusion process, and the real genetic code involves a 63-dimensional process. To make 
this problem tractable, we use the notion of the "effective selective disadvantage" of more 
volatile codons, as discussed above. This allows us to consider the dynamics only at the 
favored codons, thereby reducing the dimensionality of the diffusion process. 

The neutral (a = 0) or almost neutral (a <C u) regimes are straightforward: here all 
synonymous codons for the favored amino acid have the same effective fitness. In this regime, 
each synonymous codon occurs with the same probability in steady state, independent of 
population size. 

For the remainder of this section, we analyze the case of strong negative selection (cr ^> u) 
at a single site. We consider a diffusion approximation to the process of mutation, selection, 
and drift operating only on the synonymous codons, to each of which we assign an effective 
selective coefficient. For the simplified three-codon genetic system, the more volatile codon 
Gt2 has an effective selective disadvantage of s = u compared to codon a\. For the real genetic 
code, more volatile codons will have a selective disadvantage of this order, but the precise 
value of s will depend on the specific amino acid in question. In the following analysis, we 
consider the case of the simplified three-codon system. However, we do not explicitly make 
the substitution s = u, so that our results can also be applied (with a slightly different value 
of s) to the real genetic code. 

The time-dependent frequency f(x,t) of allele ai relative to allele a 2 can be described by 
the Komolgorov forward equation j2l] 



= -^{a(x)f(x, t)} + l^{b(x)f(x, t)} (8) 

where the instantaneous mean and variance in the change of allelic frequency are given by 

a(x) = sx(l — x) — ux + u(l — x) 
b(x) = x(l-x)/N. 

The stationary distribution of allele frequencies f(x) satisfies the equation 

±{b{x)f{x)} = 2a(x)f(x) (9) 

which has the solution [5^] 

f{x) = Cx e - 1 (l-x) 9 - 1 e Sx (10) 
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where 9 = 2Nu, S = 2Ns, and C is chosen so that f(x)dx = 1. Since s ~ u (and thus 
5 ~ 9), the shape of the stationary the distribution f(x) falls into two categories: a bell- 
shaped distribution in the regime 9 > 1, and a U-shaped distribution in the regime 9 < 1. 
In other words, for > 1 the steady-state population is typically polymorphic at the locus, 
much like the infinite population mutation-selection balance. Whereas for 9 < 1 the steady- 
state population is usually near-monomorphic at the locus, occasionally switching between 
alleles a\ and a 2 , with a bias (whose strength is determined by S) towards allele a\ . 
In stationary state, the expected frequency of allele ai is given by 

where B(x, y) is the modified Bessel function of the first kind. Similarly, the variance in the 
frequency of allele a\ is given by 

V(9,S) = [ x 2 f(x)dx- M(9) 2 (12) 
Jo 

1 29B(9 -1/2, S/2)B(9 + 3/2, S/2)- (1 + 29)6(9 + 1/2, S/2) 2 
4 + 89 + (4 + 89)B(9 - 1/2, S/2) 2 ( ' 

We use the standard Taylor series expansion of B(x,y), 

*-^ n m\Y{x + m + 1 

m=0 v ' 

to obtain a simple approximation for the mean stationary frequency of allele a-^: 

M(9,S) = 1 - + ^ + 0(9 2 ), (15) 

valid for 9 ~ S <C 1. This approximation indicates that the difference in expected volatility 
at a site under neutral versus negative selection is of order S, when 9 1. 

When 9 = S = 1, the mean stationary frequency of allele a x assumes the value ^ 
0.58. For ~ 5 3> 1, the mean frequency quickly approaches the asymptotic value 
lim^oo M(9, 9) = , in agreement with our earlier result for an infinite population. 

The results in this section generalize our analysis of an infinite population. For an infinite 
population, we found that the expected relative frequency of codon a± equals \ in the almost 
neutral regime, and it equals in the strong selection regime. In a finite population 

with 9 3> 1, the same results hold. In a finite population with 9 <^ 1, the expected relative 
frequency of the more volatile codon equals \ in the neutral regime, and it equals \ + ^ 
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in the strong selection regime. For any population size, the relative frequency of codon a\ 
depends monotonically on the strength of selection at the amino acid level, a, and it exhibits 
a sharp transition at the critical value a = u. 

It is worth noting that our exact expression (Eq. ITT|) for the mean stationary frequency of 
allele a\ generalizes earlier work by Bulmer [5] on the relative frequency of two synonymous 
codons that experience a direct fitness difference. In the limit of small 9, we find that 

which agrees with Bulmer's result (his Equation 6). In other words, Bulmer 's approximation 
applies only for vanishing small mutation rates (or population sizes). 

We can again use the standard Taylor expansion of the Bessel function to obtain a simple 
expression for the variance in the stationary frequency of allele cti, 

16(3 + 20)(1 + 20)2 ' U7j 



which is a highly accurate approximation for all 0, provided as usual that S is of order or 

1 _ 

4 2' 



smaller. Note that when 1 the variance is approximated by 4 — |, and when ^> 1 the 



variance is of order 

4.1 Inferring Negative Selection in a Finite Population 

Our exact (Eq. ITT|) or approximate (Eq. IT5Jl expressions for the stationary mean frequency 
of codon a\ allow us to determine the minimum number of sites required for codon volatility 
to distinguish reliably between neutral versus negative selection. When sites are modeled 
independently (equivalent to the assumption of linkage equilibrium [32] ) 5 under neutrality 
(ff C «; s = 0) the relative frequency of codon a\ versus codon a<i across a gene of length 
L is binomially distributed with mean | and variance jr. If, on the other hand, the gene 
experiences negative selection (er 3> u; s = u), then the relative frequency of codon a\ is 
binomially distributed with mean M(0, S) and variance M(0, S)[l — M(0, S)]/L. Therefore, 
in order to reliability reject neutrality at about the 95% confidence level, we require 

M{B,S)-± > 2^J (18) 

Using this equation, Fig. El shows the minimum number of sites required to reliably distin- 
guish negative selection from neutrality on the basis of codon volatility, under our simplified 
'genetic code' consisting of three codons. 
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Figure 3: The relationship between the scaled population size, 9 = 2Nu, and the minimum 
number of sites required to distinguish negative selection from neutrality, at the 95% confi- 
dence level. Sites are assumed to be unlinked. It is important to note that the appropriate 
effective population size that determines the value of 9 in practice does not necessarily equal 
the average neutral site heterozygosity (see Sectional). 



Eq. applies when comparing a collection of neutral sites against a collection of sites 
under negative selection. In most situations, however, the selective constraint a will vary 
across the sites of a protein. For example, consider gene X with L + J sites under negative 
selection, compared to gene Y with L neutral sites and J sites under negative selection. 
In this case, the expected frequency of codon a\ in gene Y is (L/2 + JM(9, S))/(L + J). 
Therefore, in order to reliably infer that gene X experiences more negative selection than 
gene Y, at the 95% confidence level we require 



L/2 + JM{9 1 S) L/4 + JM(9,S)[l-M(9,S)] 

M{e ' s) (ITj) — > 2 v (my (19) 

As Eq. indicates, the power to discriminate between two genes is decreased when both 
genes contain many sites, J, under negative selection and only a few sites, L, under different 
selective regimes. Nevertheless, provided J ~ L, the power to discriminate between genes 
X and Y is decreased by ~20% at most (compared to J = 0), and so the minimum number 
of sites required to detect negative selection (Fig. EJ) remains mostly unchanged. 

Although the results in this section were derived for a simplified genetic code, the scaling 
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behavior of these solutions holds for the full genetic code as well - i.e. when comparing 
neutrality to negative selection, for 9 <C 1 the expected difference in volatility per site will 
be of order 9; and for 9 3> 1 the expected difference in volatility can be calculated from the 
infinite population model (Eq. Eland Tabled)). 

4.2 Relaxation towards steady state 

Although Eq. predicts the steady-state relative frequencies of codons a\ and a2 in the 
selected regime (a 3> u), we have not yet discussed how long it takes, on average, to reach 
this steady state. In the case of a very large population, 9 3> 1, we know from the infinite 
population model (Section that the e-fold relaxation time to equilibrium is of order - 
generations. In this section, we demonstrate that the same result applies to the time scale 
of relaxation towards steady state in the regime 9 -C 1. 

As usual, we consider a single site under negative selection. In the regime 9 <C 1, we have 
seen that the steady-state population will spend most of the time in a nearly monomorphic 
state, with a preference (of order 9) for the less volatile codon, a%. Therefore, in order 
to calculate the time scale of relaxation towards steady state, we may simply calculate 
the amount of time required such that, starting with a population fixed for allele ci2, the 
probability of the population remaining fixed for allele a<i has been reduced e-fold. 

Given a population initially fixed for codon 02, there are Nu mutations to codon ai 
generated per generation. Each of these mutations has an effective selective advantage s = u 
over allele 02, and will therefore fix with probability 2s/ (1 — e~ 2Ns ) 1QJ. Hence the rate of 
production of a mutation that will eventually fix is given by 

2Nus , , 

P *» = 73^v7 « «. (2°) 

assuming 9 <C 1. According to this calculation, the mean time until fixation of codon a± is 
of order - generations, which gives the time scale of relaxation to the steady-state codon 
usage in a finite population under negative selection. 



5 About Population Sizes 

As discussed above, the strength of the signal of negative selection depends upon the pa- 
rameter 9 = 2Nu. What is the appropriate value of 9 in practice? 

Unfortunately, this question is far easier asked than answered. Population geneticists 
have long struggled to reconcile estimates of 9 deduced from polymorphism data with direct 
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measurements of N and u across broad taxonomic ranges. The effective population sizes of 
micro-organisms in particular are topics of active debate. Estimates of 9 are usually obtained 
by comparing SNP data at neutral (or presumably neutral) sites against the expected site 
diversity or the expected number of segregating sites under a neutral model jT3j. In a recent 
survey jSOI authors have reported an average value of 9 ~ 0.15 among the prokaryotes 
studied. But estimates of 9 for a microbial species can vary by four orders of magnitude, and 
they depend strongly on assumptions about population structure |3] . To complicate matters 
further, heterogeneity in mutation rates leads to substantial underestimates of 9 [46 . 

Aside from uncertainty in its estimation, the value of 9 deduced from neutral SNP data 
[HO] rnay not be relevant to questions of selection and volatility. Monomorphism observed 
at neutral sites may result from non-neutral processes, such as background selection [2j or 
hitchhiking on periodically sweeping sites [31 . As a result, the variance effective population 
size estimated from SNP data may not be relevant to other aspects of evolution, such as 
substitutions at linked weakly selected sites [T6] . 

One particularly striking example of a discrepancy in the appropriate effective population 
sizes arises from the consideration of mutator phenotypes. Populations of microbial species 
periodically experience a transient increase in the mutation rate, often 10 2 — 10 3 times greater 
than that of a non- mutator strain ^7j. Between 2-20% of bacterial populations isolated in 
the wild at any given time exhibit a mutator phenotype ESI EE] • The mutator phase can 
be induced in several ways. A defective DNA repair gene may arise and sweep to fixation by 
hitchhiking on a positively selected mutation . The entire population then experiences an 
elevated mutation rate until a non- mutator allele sweeps and replaces the mutator |331 IP2]. 
A second, perhaps more common mechanism is stress-induced mutagenesis; natural isolates 
of E. coli often experience an increase in their mutation rate in response to stress jl]. As 
a result of these and other observations, researchers have argued that bacterial populations 
evolve primarily by periodic acquisition of mutator phenotypes followed by adaptive sweeps 
and subsequent loss of the mutator [T3[l2llHE]. As we shall see, the effect of this process on 
synonymous codon usage is dramatic: the expected site diversity is driven by the value of 9 
in the wildtype regime (9 W = 2Nu w ), but the pattern of synonymous codon usage at a site 
under negative selection is driven by the value of 9 in the mutator regime (9 m = 2Nu m 3> 9 W ). 

As a simple example of this phenomenon, we have simulated a Fisher- Wright model of 
a single locus in a population of constant size N = 1000. The simulated site is subject to 
recurrent mutation between "alleles" a\ and a 2 at wildtype rate u w = 10~ 5 . As in Section 
HI the alleles ai and a 2 differ in fitness by s, where s equals the mutation rate. Periodically, 
we model the fixation of a mutator allele (or, equivalently, the stress-induced mutagenesis 
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across the entire population) by exogenously increasing the mutation rate to u m = 10 3 x u w 
for 100 generations; thereafter we (artificially) enforce a selective sweep at the site, followed 
by reversion to the wildtype mutation rate. Overall, the population experiences the mutator 
regime for 5% of the time, consistent with observed frequencies of mutator phenotypes in 
the wild [T7J EHl I2H] • According to our simulations, the average site diversity, 2x(l — x), at 
a randomly chosen time equals 0.028, which is close to its expected value assuming that 9 is 
given by 9 W : E[2x(l — x)\ = 9 W = 0.02. But the average frequency of allele a\ equals 0.611, 
which is close to its expectation assuming that 9 is given by 9 m \ K[x] = M(9 m , 9 m ) = 0.616 
(Eq. Hip . In other words, the average frequency of the less volatile codon a\ is dominated 
by the mutator periods, but the average site heterozygosity (and any estimate of 9 based on 
it) is dominated by the non-mutator periods. 

There is a simple, intuitive explanation for this result. The average heterozygosity at the 
site is low at virtually all times (except during the brief mutator periods) because selective 
sweeps cause monomorphism, followed by long periods of low 9. Therefore, the effective 9 
for SNP diversity is small, i.e. close to Nu w . But the site converges quickly towards the 
less volatile codon during the mutator periods, since the rate of convergence is determined 
by s = u m . And the site is essentially frozen during the non-mutator periods, since the 
decay rate of volatility is only u w . Therefore the expected frequency of a\ at a random time 
is primarily determined by the frequency reached during the mutator regime. As is clear 
from this explanation, the expected frequency of codon a\ will, in general, depend upon 
the stochastic scheduling of mutator periods. For example, the site will converge towards 
M(9 m ,9 m ) provided the population experiences at least one mutator phase of duration of 
order l/u m generations, within every l/u w generations. In fact, even if the mutator phases 
are very brief and infrequent, the average frequency of allele a\ can greatly exceed the value 
predicted by 9 estimated from the average site heterozygosity. 

Although the simple model used in this section does not describe any but the most 
phenomenological features of mutator alleles, it does reveal an important general observation: 
the value of 9 estimated from neutral SNP data does not in general equal the effective value 
of 9 that determines synonymous codon usage at a site under negative selection. This result 
is of utmost importance to any discussion of the relationship between 9 and the power of 
volatility to detect negative selection. 
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6 Positive selection 



In the sections above, we have considered selection that opposes a change to the amino acid 
at a site. This type of negative selection induces a bias towards the less volatile codons for 
the favored amino acid at a site. However, selection sometimes favors a change in the amino 
acid at a particular site. In such situations, as we will demonstrate, a site is more likely to 
be occupied by a codon of greater than average volatility. 

A variety of mechanisms are known to cause positive selection. Frequency dependence 
often induces diversifying selection at a site, whereas an exogenous change in the environment 
can induce directional selection for a new, specific amino acid. We do not here model all 
of the various types of positive selection, but rather focus on the essential aspect shared 
by these mechanisms. We analyze the dynamics at a site that has, for a period of time, 
experienced negative selection for amino acid A, and that subsequently experiences negative 
selection for different amino acid, B (for whatever reason). We refer to the change in the 
selective regime as a positive selection event. 

Prior to the onset of positive selection, amino acid A is assigned fitness 1 and all other 
amino acids fitness 1 — a; subsequently, amino acid B is assigned fitness 1 and all others 
fitness I — a. We assume that Na 3> 1 (otherwise, the site is effectively neutral at the amino 
acid level) and that a u (otherwise, the expected codon frequencies are uniform). Once 
the population shifts to the new amino acid B, it is clear that the site will more likely contain 
a codon that is more volatile than the average 5-codon, because it has just arisen through a 
nonsynonymous mutation. Since B is now favored, negative selection subsequently operates 
to reduce the volatility at the site. However, this process takes time. Thus, for some time 
after the positive selection event, there is a bias toward elevated volatility at the site, which 
gradually decays. In this section, we analyze this process. 

Analagously to previous sections, we initially consider a simplified genetic code consisting 
of four codons, ai, a 2 , 61, and b 2 , the first two of which encode amino acid A, and the latter 
two amino acid B. Mutations can only occur between codons a± and a 2 , a 2 and 62, and b 2 
and bi, creating the mutation structure 

ai ^ a 2 <=± b 2 ^ 61. 

In this simplified genetic code, codons a 2 and b 2 are the more volatile codons for their 
respective amino acids. 

After the change in selection from amino acid A to B, a mutation to codon b 2 that 
survives stochastic drift will eventually arise. Thus, at least initially, the more volatile 
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codon 6 2 is more prevalent than the less volatile codon b\. During this period, we can detect 
the signature of the positively selected sweep because of the elevated volatility at the site. 
However, negative selection for amino acid B will eventually favor codon b\. Therefore, the 
volatility signature of the positive selection event will be present provided that the time scale 
of decay toward codon b\ is longer than the interval since the positive selection event. 

Fortunately, the time scale of decay towards b\ is quite long. For 6 3> 1, we can use the 
infinite population model to find this time scale. As discussed above, the time required to 
reduce the volatility e-fold is of order -. For 6 <C 1, we must use a finite population size 
calculation. In this regime, the population is nearly monomorphic at almost all times. Fol- 
lowing the selective sweep, the site will be monomorphic for 62 with almost unit probability. 
We are interested in the duration of time required such that probability of being monomor- 
phic for b 2 (as opposed to b\) has been reduced e-fold. The probability of switching between 
62 and bi, however, is of order u per unit time (even before 6 2 has finished outcompeting a 2 ), 
according to Eq. 1201 Thus, the time scale of decay in a finite population is also -. 

According to this analysis, a selective sweep will result in the presence of a more volatile 
codon for of order - generations - a very long time indeed. (In the case of E. coli, for 
example, - generations is nearly 100, 000 years, given «Ri5x 10~ 10 and generation time ~ 
20 minutes. The generation length and resulting time scale for E. coli in the wild may be 
much longer yet JSj-) Equivalently, repeated sweeps for amino acid changes at a site will 
result in the presence of more volatile codons at almost all times, provided that new sweeps 
occur more often than every - generations. 

6.1 Inferring Positive Selection 

The above analysis for a simplified genetic system generalizes in an obvious way to the 
real genetic code. After a positive selection event at a site, the population switches from 
a codon for amino acid A to a codon for amino acid B. The expected volatility of the 
new codon is greater than the average volatility of I?-codons, because the new codon has 
just arisen through a nonsynonymous mutation. To be more precise, if the population is 
monomorphic for a random non-I? codon before the selective sweep, then after the sweep 
occurs the expected relative frequencies of the 5-codons are given, approximately, by their 
relative volatilities. Subsequent to the selective sweep, the increased volatility at the site 
will decay on a time scale of order of - generations. 

There is a critical distinction between the volatility signature of positive selection versus 
that of negative selection. The depressed volatility at a site under negative selection is 
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caused by a mutation-selection-drift balance. When the effective population size is small, 
a large number of sites are required to distinguish negative selection from neutrality. By 
contrast, the volatility signature of positive selection is not an equilibrium property, and 
it is not sensitive to population size. Regardless of 9, the probability of sampling a more 
volatile codon is significantly elevated immediately after a selective sweep at a site, and this 
probability decays only at rate u. 

As we have seen, a gene that contains many sites under positive selection will exhibit a 
greater volatility (controlling for its amino acid composition) than a gene under neutral or, 
especially, negative selection. How many positively selected sites are required in order to 
detect a reliable signal? In the case of Leucine, for example, the expected volatility of a site 
that has recently experienced a positively selected sweep is approximately 0.660 ±0.072 (one 
standard deviation), whereas a neutral Leucine site has expected volatility 0.646 ± 0.073, 
and a Leucine site under negative selection has expected volatility 0.632 ± 0.0070 (see Table 
[TJ). Therefore, the volatility of about 100 Leucine sites under positive selection will be 
significantly greater (at the 95% confidence level) than that of 100 neutral sites. Similarly, 
the volatility of about 25 positively selected Leucine sites will be significantly greater than 
that of 25 negatively selected sites. Similar results hold for Serine and Arginine; Glycine is 
less informative. 

It is worth noting that the elevated volatility for a positively selected Serine site will 
decay even more slowly than for other amino acids, because the highly volatile codons ACC 
and AGT are not connected by synonymous mutations to other serine codons. 

7 Discussion 

7.1 Codon volatility versus comparative sequence analysis 

Selection pressures on proteins are usually estimated by comparing homologous nucleotide 
sequences jHZj- Orthologous genes are identified in different organisms and sequenced; their 
sequences are then aligned, and the changes that have accumulated since divergence are 
used to infer the selection pressures that have been acting [T%] . When available, sequence 
variation sampled from individuals within a species can be compared with variation across 
species to produce an elegant test for adaptive evolution at a locus (33 H2j- in addition, 
there are a variety of statistical tests designed to detect a departure from neutrality in the 
site frequency spectrum sampled within a single species (see ref. and references therein). 
In many cases, the complete distribution of these statistics under the neutral null model are 
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difficult to derive, but they have been studied through computer simulation jHj. 

Techniques for estimating selective constraints via sequence comparison are typically ap- 
plied, independently, to one or several genes at a time. When extensive intra- or inter-specific 
sequence data are available at a locus of interest, such techniques have proven enormously 
useful for measuring selection, and it is unlikely that they will be significantly improved by 
incorporating information about synonymous codon usage. But the accurate estimation of 
selective constraints requires a large number (approximately six or more j2]) of orthologous 
sequences for each gene of interest. At the genome-wide scale, comparative data (i.e. orthol- 
ogous gene sequences) will not be available for all genes, and methods to estimate selective 
constraints based on sequence comparison will often be inapplicable. Furthermore, the genes 
under positive selection are often of particular interest, but such genes are even less likely 
to have identifiable orthologs in related species due to their rapid sequence divergence [39J. 
Even in the lineage of the Saccharomyces genus, which is currently the best-case scenario 
for comparative genomics, the genomes of four species have been fully sequenced and only 
two-thirds of the genes in S. cerevisiae have unambiguously identifiable orthologs in related 
species jlOj- Unlike comparative techniques, the analysis of synonymous codon usage offers 
a computational tool to screen for selection pressures on all genes in a sequenced genome. 
Genome-wide screens based on analyzing synonymous codon usage may prove useful in iden- 
tifying important classes of genes under strong selection, such as the antigens of pathogens 

Unlike most comparative statistics that test for a departure from neutrality, estimates of 
selection based on bootstrapped volatility scores are not 'estimators' in a rigorous statis- 
tical sense - i.e. statistics whose sampling properties can be derived from a null model, and 
which can be used in likelihood ratio tests of a null hypothesis |S3 IE] • Given the expected 
relative frequencies of codons that we have derived for each of the three regimes (neutral, neg- 
ative, and positive selection; Table Q), it may yet be possible to design maximum-likelihood 
methods that estimate the number of sites of a gene in each regime. This approach will be 
complicated, however, by other sources of codon bias; see below. 

Aside from the different situations in which they are applicable, and differences in the 
rigor of their derivation, estimates of selection based on codon volatility differ in a fun- 
damental way from most estimates based on sequence comparison. Homologous sequence 
comparison between species is often used to assess, either by maximum likelihood ^H] or 
maximum parsimony |29j . the rates of synonymous and non-synonymous substitutions in 
a coding sequence. The ratio of these rates, dN/dS, is used as a measure of the selective 
constraints that have been acting on a protein since the divergence of the species being 
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compared. An alternative approach, based on a Poisson Random Field (PRF) model of 
mutation frequencies, uses the site frequency spectrum at a locus sampled from individuals 
within a species to deduce the average selective pressure for or against amino acid changes 
in a gene (Poisson Random Field models can also be used to construct likelihood ratio 
tests of departure from neutrality [0].) Like most comparative methods, however, both of 
these models typically assume that all the sites within a gene experience the same selective 
pressure against amino acid substitutions (but see the site-by-site likelihood tests of Yang 
et al. jH]). Under the PRF theory, for example, authors have estimated a very small "av- 
erage" selection pressure against amino acid changes in E. coli genes: a ~ 10~ 8 [20]. This 
value does not represent the arithmetic average of the true a values across sites, but rather 
the best-fit constant value of a that would make the PRF model consistent with observed 
sequence variation at polymorphic sites. 

When evolutionary rates are estimated at individual residues |54[ 153] , however, we find 
great variation across sites. Moreover, direct experimental measurements of the fitness con- 
sequences of amino acid substitutions in micro-organisms reveals huge variation in selection 
pressures across the residues of an individual protein: a substantial proportion of substi- 
tutions are lethal, and a substantial proportion have undetectable effect I2UJ EH EE El- 
Therefore, it is not entirely clear how best to interpret the value of a ~ 10~ 8 estimated for 
E. coli genes using the PRF model, which assumes constant pressure at each residue. 

Compared to dN/dS or a estimated by the PRF model, codon volatility quantifies selec- 
tion pressures in a very different, coarser manner. As discussed above, volatility essentially 
measures the number of sites in a gene that experience negative (er ^> u) versus neutral 
(a <C u) versus positive selection. Given that, in reality, many amino acid changes to a 
protein sequence are lethal while other changes have no effect whatsoever, it is reasonable 
and meaningful to estimate the number of sites in the selected versus neutral regimes. But 
volatility is not sensitive to variation in selective pressures within either of these regimes. 
Hence, the volatility measure is in some ways a coarser description of selective pressure than 
PRF or dN/dS. One should not necessarily expect that volatility will correlate very strongly 
with dN/dS or PRF estimates, because the latter measures represent some sort of average a 
over the entire gene, and are thus presumably sensitive to the full range of variation in a. A 
measure based on codon volatility is therefore different from and complementary to dN/dS 
or PRF estimates of the selective constraints on a genes. 

As an aside, it is important to note that the most common model used to estimate dN/dS 
from divergent nucleotide sequences ^H] does not itself reflect the relationship between se- 
lection and volatility. dN/dS is often estimated by fitting maximum likelihood parameters 
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to a simplified Markov-chain model of sequence evolution that ignores population variability 
[T%] . Models that ignore population variability are perfectly reasonable approximations when 
comparing the sequences of relatively divergent lineages [TH]; but such models fail to detect 
the effect of amino-acid selection on synonymous codon usage. Such models consider only 
a single sequence that is assumed to represent the dominant genotype in the population at 
any time. Mutation and selection are modeled simultaneously by adjusting the transition 
rates between codon states in the sequence ^B]- As a result, in equilibrium, the number of 
transitions into a state per unit time must equal the number of transitions out of that state; 
and so equilibrium synonymous codon usage does not depend upon the strength of selection 
in these simplified models (In fact, under the standard assumption of time-reversibility, 
such models require as parameters the specification of the equilibrium codon usage [IB], and 
therefore they clearly cannot be used to predict equilibrium codon usage.) Simulations of 
sequence evolution based on these simplified models (such as the non-frequency-dependent 
simulations of Zhang jHE]) will thus fail to detect the relationship between dN/dS and volatil- 
ity, whereas more detailed simulations that account for population variability (such as the 
frequency-dependent simulations of Zhang jHEl, as well as the non-frequency-dependent sim- 
ulations in this work) will properly reflect the relationship between selection and volatility, 
as predicted by Fisher- Wright models of a replicating population. 

7.2 Other sources of codon bias 

Although it came as a surprise to early neutral theorists j2H|, it is now clear that there are 
several processes that result in unequal usage of synonymous codons. Many processes that 
cause codon bias in microorganisms, such as biased nucleotide content or mutation rates, 
can apply roughly equally to all the genes in a genome. To the extent that other sources of 
codon bias apply equally across a genome, it is straightforward to control for these biases 
when comparing the volatilities of genes within a genome to estimate selection pressures on 
proteins [US!- 

To the extent that other sources of codon bias differ from gene to gene within a genome, 
they may (if not properly controlled for) introduce errors into estimates of the relative 
selection pressures on proteins inferred from codon volatility [35] • Similarly, selection on 
synonymous codons - particularly selection that varies from gene to gene - will likewise 
introduce errors into estimates of selection on protein sequences obtained by comparative 
techniques such as dN/dS [I3JI22I- 

As we have argued, some of the variation in synonymous codon usage across a genome is 
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caused by the variation in selection pressures on protein sequences. Throughout our analysis, 
we have specifically ignored any other source of codon biases so as to derive the effects of 
selection at the amino acid level on codon usage. But in many organisms other processes 
that vary between genes are certainly operating as well. For instance, it is known that the 
transition/transversion mutation bias can vary across a genome. Results on S. cerevisiae, 
whose genome exhibits marked variation in the tr/tv bias [JD], suggest that this source of 
variable codon bias will not distort estimates of selection based on volatility: whether or 
not one accounts for the variation in the tr/tv bias across the genome of S. cerevisiae one 
obtains virtually the same rankings of gene volatilities (r > 0.99) [3D] . 

Aside from mutational biases, there are other sources of codon bias that vary from gene 
to gene in some organisms. In the yeast S. cerevisiae, researchers have observed that synony- 
mous codon usage, measured by the Codon Adaptation Index (CAI) [UI|, is correlated with a 
gene's expression level in laboratory conditions j^j. This correlation is thought to be caused 
by selection for translational efficiency and/or accuracy: a codon corresponding to a more 
abundant tRNA is expected to be translated more quickly (due to the higher probability per 
unit time that the appropriate tRNA will "find" the codon) and more accurately (since the 
correct tRNA will likely have the greatest chance of pairing if it is the most abundant). 

Considering this alternative source of biased codon usage, two questions should be asked: 
do other sources of codon bias distort estimates of selection based on volatility, and how can 
we control for these confounding factors? Unfortunately we do not have a truly satisfactory 
answer for either of these questions, but the discussion below may shed some light on the 
issues involved. 

With regards to the first question, we note that the degree to which other sources of codon 
bias may distort volatility-based estimates of selection will strongly depend on the organism 
being studied. Some species (such as humans) exhibit a much weaker correspondence between 
codon frequencies and tRNA abundances than others species; so clearly other sources of 
codon bias will affect volatility values differently in different species. In a species with a 
strong correspondence between codon usage and tRNA abundances, the extent to which 
variation in this source of codon bias across the genome affects volatility will depend on 
whether volatile codons are (un)preferred: if there is no correlation between volatility and 
tRNA abundances, then the other sources of codon bias will only introduce random error 
into volatility estimates, making them less powerful but still reliable. If instead the preferred 
codons tend to have either high or low volatility, then this effect could introduce systematic 
errors into volatility estimates. In the latter case, in order to quantify how much codon usage 
bias is caused by volatility as opposed to other factors, one would require a method to predict 
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for individual genes the amount of codon bias due to these other factors. Unfortunately we 
are far from having the necessary level of predictive power for other sources of codon bias 
in any organism. Although gene expression level is somewhat predictive of codon bias, 
expression levels do not explain most of the variation in codon bias in any genome studied 
thus far [U|U]. Until the various sources of biased codon usage can be reliably disentangled, we 
cannot reliably quantify the effects of these biases on volatility-based estimates of selection. 

The second question, how to control for other sources of biased codon usage, is also 
difficult to answer at present. As discussed above, an appropriate method to control for 
other sources of bias would require disentangling the various sources of codon bias in a 
predictive manner for each gene. While this degree of precision is not currently possible, 
one approach is to assume that the codon bias measured by CAI is entirely independent 
of volatility, and then control for CAI using partial correlations. For several reasons, we 
expect this approach to be conservative, as we illustrate using the yeast S. cerevisiae (we 
use this species as an example because it shows a strong preference for codons that match 
abundant tRNAs, and because we have reliable dN/dS values for almost two-thirds of its 
genes, calculated from multiple alignments of closely related species [22] )• First, we note 
that dN/dS is itself strongly correlated with both CAI and gene expression levels [37], and 
it is therefore impossible to construct any measure of selective constraint that agrees with 
dN/dS and is not itself strongly correlated with CAI and expression levels in yeast. Second, 
it is possible that the codon bias measured by CAI is in part caused by volatility (i.e. 
highly expressed genes tend to experience stronger purifying selection and therefore exhibit 
unusual codon usage biased towards lower volatility), and so controlling for CAI would 
be inappropriate. Despite several biological hypotheses, there is no accepted mechanistic 
explanation for the correlation between CAI and dN/dS in yeast [HUE], and so it is unclear 
whether controlling for CAI is appropriate. Nevertheless, we have tested the correlation 
between volatility and dN/dS while controlling for CAI using a partial correlation. We find 
that even when controlling for CAI (or expression levels), there remains a highly significant 
correlation between volatility and dN/dS in yeast (p < 10~ 34 [ID!)- Therefore, even under 
this conservative test, estimates of selection obtained by volatility are still consistent with 
estimates obtained by homologous sequence comparison. We interpret this result as evidence 
that volatility is measuring selective constraints above and beyond any signal inherent in CAI. 

Indeed, there is a great deal of empirical evidence indicating that the volatility of a gene is 
correlated with the selective constraint it experiences. Aside from highly significant correla- 
tions between volatility and dN/dS in bacterial species and yeast [HH], volatility also reflects 
a range of other features known to correlate with selection on proteins. In S. cerevisiae, 
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for example, volatility is strongly correlated with the essentiality of genes, the number of 
their protein-protein interactions, and the degree to which they are preserved throughout the 
eukaryotic kingdom Furthermore, volatility is significantly elevated among the known 
antigens and surface proteins (which experience positive selection) in the pathogens Mycobac- 
terium tuberculosis, Plasmodium falciparum, and Influenza A virus [3H1 EH] • And volatility is 
significantly depressed in the genes essential for growth of M. tuberculosis, as well as in the 
genes conserved between related Mycobacterium species Therefore, despite potential 

confounding sources of codon bias that cannot at present be controlled for with appropriate 
accuracy, in practice volatility-based methods produce estimates of selection pressures that 
are consistent with our understanding of protein evolution over a diverse range of taxa. 

Finally, we note that there may be direct selection on synonymous codons in order to 
evade mistranslation Since mistranslation is far more likely to occur between a codon 
and an anticodon that differ by a single nucleotide, the definition of volatility (Eq. ^) is 
appropriate for measuring the selective pressure for or against mistranslation. The strength 
of this type of selection on synonymous codons would depend upon the mis-incorporation 
rate of tRNA (which is far higher than the mutation rate) and the detriment of mistranslation 
(which is likely far lower than that of most mis-sense mutations). It is difficult at present to 
measure the molecular parameters of tRNA mis- incorporation and its fitness effects; so it is 
unclear how much of a volatility signal arises from mistranslation avoidance versus standard 
selection on mis-sense mutations. However strong this signal, though, the volatility of a gene 
would still reflect the degree to which there is selection to conserve, or not to conserve, the 
(translated) protein sequence. 
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